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ABSTRACT 

In this paper we introduce the first efficient external-memory 
algorithm to compute the bisimilarity equivalence classes of 
a directed acyclic graph (DAG). DAGs are commonly used 
to model data in a wide variety of practical applications, 
ranging from XML documents and data provenance models, 
to web taxonomies and scientific workflows. In the study of 
efficient reasoning over massive graphs, the notion of node 
bisimilarity plays a central role. For example, grouping to- 
gether bisimilar nodes in an XML data set is the first step 
in many sophisticated approaches to building indexing data 
structures for efficient XPath query evaluation. To date, 
however, only internal-memory bisimulation algorithms have 
been investigated. As the size of real-world DAG data sets 
often exceeds available main memory, storage in external 
memory becomes necessary. Hence, there is a practical need 
for an efficient approach to computing bisimulation in ex- 
ternal memory. 

Our general algorithm has a worst-case IO-complexity of 
0(Sort(|7V| + \E\)), where \N\ and \E\ arc the numbers of 
nodes and edges, resp., in the data graph and SORT(n) is 
the number of accesses to external memory needed to sort 
an input of size n. We also study specializations of this 
algorithm to common variations of bisimulation for tree- 
structured XML data sets. We empirically verify efficient 
performance of the algorithms on graphs and XML docu- 
ments having billions of nodes and edges, and find that the 
algorithms can process such graphs efficiently even when 
very limited internal memory is available. The proposed 
algorithms are simple enough for practical implementation 
and use, and open the door for further study of external- 
memory bisimulation algorithms. To this end, the full open- 
source C++ implementation has been made freely available. 



1. INTRODUCTION 

Data modeled as directed acyclic graphs (DAGs) arise in 
a diversity of practical applications such as biological and 
biomedical ontologies [27], web folksonomies [24], scientific 
semantic web schemas [I], business process 
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workflows 

modeling |6| |f 1| , data provenance modeling [21 1|22] , and the 
widely adopted XML standard [I]. It is anticipated that the 
variety, uses, and quantity of DAG-structured data sets will 
only continue to grow in the future. 

In each of these application areas, efficient searching and 
querying on the data is a basic challenge. In reasoning over 
massive data sets, typically index data structures are com- 



puted and maintained to accelerate processing. These in- 
dexes are essentially a reduction or summary of the under- 
lying data. Efficiency is achieved by performing reasoning 
over this reduction to the extent possible, rather than di- 
rectly over the original data. 

Many approaches to indexing have been investigated in pre- 
ceding decades. Reductions of data sets typically group to- 
gether data elements based on their shared values or sub- 
structures in the data. In graphs, the notion of bisimulation 
equivalence of nodes has proven to be an effective means 
for indexing (e.g., [TJ [l3] [it] [l6] [l8] [20] [28]). Bisimulation, 
which is a fundamental notion arising in a surprising range 
of contexts 26 , is based on the structural similarity of sub- 
graphs. Intuitively, two nodes are bisimilar to each other 
if they cannot be distinguished from each other by the se- 
quences of node labels that may appear on the paths that 
start from these nodes, as well as from each of the nodes on 
those paths. Grouping bisimilar nodes is known as bisimu- 
lation partitioning. Blocks of bisimilar nodes are then used 
as the basis for constructing indexing data structures sup- 
porting efficient search and querying over the data. 

Efficient internal-memory solutions for computing bisimula- 
tion partitions have been investigated (e.g., [12| |14| [23] ) . To 
scale to real- world data sets such as those discussed above, 
it becomes necessary to consider DAGs resident in external 
memory. In considering algorithms for such data, the pri- 
mary concern is to minimize disk 10 operations due to the 
high cost involved, relative to main-memory operations, in 
performing reads and writes to disk. 

Due to the random access nature of internal-memory al- 
gorithms, the design of external-memory algorithms which 
minimize disk IO typically requires a significant departure 
from approaches taken for internal memory solutions [l9] . In 
particular, state-of-the-art internal-memory bisimulation al- 
gorithms can not be directly adapted to IO-efficient external- 
memory algorithms due to their inherent random access be- 
haviour. While a study has been made on storing and query- 
ing bisimulation partitions on disk [28], there has been to 
our knowledge no approach developed to date for efficiently 
computing bisimulation partitioning in external memory. 

Motivated by these observations, in this paper we give the 
first IO-efficient external-memory bisimulation algorithm for 
DAGs. Our algorithm has a worst-case IO-complexity of 
0(Sort(|7V| + \E\)), where \N\ and \E\ are the number of 



nodes and edges, resp., in the data graph and SORT(n) is 
the number of accesses to external memory needed to sort 
an input of size n. Efficiency is achieved by intelligent orga- 
nization of the graph on disk, and by sophisticated process- 
ing of the graph using global and local reorganization and 
careful staging and use of local bisimulation information. 
We establish the theoretical efficiency of the algorithm, and 
demonstrate its practicality via a thorough empirical evalu- 
ation on data sets having billions of nodes and edges. 

Our algorithm is simple enough for practical implementa- 
tion and use, and to serve as the basis for further study and 
design of external-memory bisimulation algorithms. For ex- 
ample, we also develop in this paper specializations of our 
algorithm for computing common variations of bisimulation 
for tree-structured graphs in the form of XML documents. 
Furthermore, the complete implementation is open-source 
and available for download. 

We proceed in the paper as follows. In the next section, we 
present basic definitions concerning our data model, bisimu- 
lation equivalence, and the standard external-memory com- 
putational model. In Section[3] we then present and theoret- 
ically analyze our external-memory bisimulation algorithm. 
In Section[4] we show how to specialize our general algorithm 
for various bisimulation notions proposed for XML data. In 
Section [5j we then present a thorough empirical analysis of 
our approach, and conclude in Section [6] with a discussion 
of future directions for research. 

2. PRELIMINARIES 

2.1 Graphs and bisimilarity 

In the context of this paper, a graph G is a triple G = 
(N,E,l), where N is a finite set of nodes, E C N x N is 
a directed edge relation, and I is a function with domain 
N that assigns a label Z(n) to every node n £ N . With a 
slight abuse of terminology, we call n a child of m, and m 
a parent of n, if and only if G contains an edge (m,n). Let 
children(m) be the set of all children of m, and let parents(n) 
be the set of all parents of n. Note that in our work we only 
consider acyclic graphs. Furthermore, we assume that the 
node set N is ordered in reverse topological order, that is, 
children always precede their parents in the order. Assuming 
a topological ordering is standard in the design of external 
memory DAG algorithms [19]. Indeed, real world data is 
often already ordered (e.g., XML documents), and, further- 
more, practical approaches to topological sorting of massive 
data sets are available |3| . 



Definition 1. Let Gi = (Ni,Ei,h) and G 2 = {N 2 ,E 2 ,l 2 ) 
be two, possibly the same, graphs. Nodes n\ £ Ni and 
n 2 £ N 2 are bisimilar to each other, denoted ni « n 2 , if 
and only if: 

1. the nodes have the same label: Zi(m) = l 2 (n 2 ); 

2. for every node n[ £ children(ni) there is a node n' 2 £ 
children(n2) such that n' x w n 2 , and: 

3. For every node n' 2 £ children(ri2) there is a node n' x £ 
children(m) such that n' x « n' 2 . 




Figure 1: Two bisimilar directed acyclic graphs. 

We can extend this notion to complete graphs as follows: 

Definition 2. Let Gi = (Ni,Ei,h) and G 2 = (N 2 ,E 2 ,l 2 ) 
be graphs. Graph Gi and G 2 are bisimilar to each other, 
denoted as Gi ~ G 2l if and only if: 

1. For every node n\ £ N\ there is a node n 2 £ N 2 such 
that n\ m n 2 , and 

2. For every node n 2 € N 2 there is a node m £ Ni such 
that ri\ « n 2 . 

Figure [T] shows two graphs that are bisimilar to each other. 
The figure also shows with dotted lines how the nodes of 
one graph are bisimilar to nodes of the other graph. Note 
that in this figure all nodes with label a are bisimilar to each 
other, all nodes with label b are bisimilar to each other, and 
all nodes with label c are bisimilar to each other. Note, 
however, that this does not hold for nodes with label d. 

For each graph G there is a unique smallest graph (having 
the fewest nodes) that is bisimilar to G; we call this smallest 
graph the maximum bisimulation graph of G and denote it 
by Gi = (Ni,Ei,li). In Figure [l] the graph on the right 
is the maximum bisimulation graph of itself. It is also the 
maximum bisimulation graph of the graph on the left. 

In the context of this paper, the bisimilarity index of a graph 
G is a data structure that stores the maximum bisimilarity 
graph Gj, of G, and stores, for each node nj_ £ N±, the set of 
nodes {n £ N : nj, w n}, i.e., the bisimulation equivalence 
class of n±. 

A partition V of a graph G = {N, E, I) is a subdivision of 
its nodes N into a set of blocks V = {Ni,N 2 , ■ ■•} such that 
each block Ni £ V is a subset of N, the blocks are mutually 
disjoint, and their union is N. A bisimulation partition of a 
graph G is a partition V of G such that the blocks of V are 
exactly the bisimilarity equivalence classes of G. 

A partition V\ is a refinement of a partition V 2 if and only 
if for every Pi £ Vi there is exactly one P 2 £ V 2 such that 
Pi C Pi. 

The rank rank(n) of a node n is defined as the maximum 
number of edges on any path that starts at n. It is easily 
proved by induction that m as n implies rank(m) = rank(n), 
and thus the bisimulation partition is a refinement of the 
partition by rank. 



2.2 Analysis of external-memory algorithms 

In this paper, we investigate algorithms operating on data 
that does not fit in main memory. Therefore we need to use 
external memory, such as disks. In general external memory 
is slow. In particular, there is a high latency: it takes a lot of 
time to start reading or writing a random data item in exter- 
nal memory, but after that a large block that is consecutive 
in external memory can be read or written relatively fast. 
Thus the performance of algorithms using external memory 
is often dominated by the external-memory access patterns, 
where algorithms that read from and write to disk sparingly 
and in large blocks are at an advantage over algorithms that 
access the disk often for small amounts of data. 

We shall use the following standard computer model to an- 
alyze the efficiency of our algorithms [5] . Our computer has 
a fast memory with a limited size of M units of data, and 
a slow, external memory of practically unlimited size. The 
computer has a fast processing unit that can operate on 
data in fast memory, but not on data in external memory. 
Therefore, during operation of any algorithm on this com- 
puter, data needs to be transferred between the two memo- 
ries. This is done by moving data in blocks of size B; such 
a transfer is called an 10. The block size B is assumed to 
be large enough that the latency is dominated by the actual 
transfer times, and thus, the time spent on external-memory 
access is roughly proportional to the number of IOs. 

The complexity of an external-memory algorithm can now 
be expressed as the (asymptotic) number of IOs performed 
by an algorithm, as a function of the input size and, possibly, 
other parameters. Clearly, reading or writing n units of data 
that are (to be) stored consecutively in external memory 
takes ©(SCAN(n)) = 6(§) IOs. Sorting n units of data that 
are consecutive in external memory takes 0(SORT"(n)) = 

e(i iog M/s (f )) ios |. 

3. BISIMULATION PARTITIONING 

State of the art internal memory bisimulation algorithms are 
based on a process of refinement introduced in the work of 
Paige and Tarjan (e.g., [12| [14 |23j ). An initial partition of 
the nodes is picked (for example: a partition based on label 
equivalence). A step-by-step refinement of this initial parti- 
tion is calculated by picking a single block S of nodes from 
the partition and stabilize all other blocks B with respect 
to this group (by splitting B into a block of nodes that have 
children in S and into a block of nodes that do not have 
children in S). 

These refinement steps require unstructured random access 
to nodes and their children. In an external memory setting, 
these accesses translate to high 10 costs. Therefore, it is not 
clear that state-of-the-art internal memory bisimulation al- 
gorithms can be effectively adapted to an external-memory 
setting. Hence, we have chosen to investigate an alternative 
approach, inspired by the recent use of node rank to accel- 
erate internal memory refinement computations 12 14] . 



3.1 Outline of our approach 

Suppose each bisimilarity equivalence class is identified by 
a unique number, the bisimilarity identifier. Let now the 
bisimilarity family of a node be the set of bisimilarity iden- 
tifiers of its children, and let the bisimilarity decision value 



of a node be the combination of its rank, its label, and its 
bisimilarity family. Then, by Definition [T] all nodes in the 
same bisimilarity equivalence class have the same bisimilar- 
ity decision value, and each bisimilarity equivalence class is 
uniquely identified by the bisimilarity decision value of its 
nodes. 

The main idea behind our algorithmic approach is now to 
match bisimilarity identifiers to nodes and their bisimilarity 
decision values, by processing the nodes in order of increas- 
ing rank. Thus, when processing the nodes of any rank r, 
the bisimilarity identifiers of the children of these nodes are 
already known and can be used to determine the bisimilar- 
ity decision values of the nodes of rank r, which can then be 
sorted in order to assign a unique identifier to each different 
bisimilarity decision value. 

To implement this approach, we use an algorithm in two 
phases. In the first phase, we compute the ranks of all nodes 
and we sort the nodes by rank and label; in the second 
phase, we obtain the bisimilarity family for each node and 
sort nodes of equal rank and label by their families. Below 
we will explain how these phases can be implemented to run 
in 0(Sort(|7V| + \E\)) IOs. After that, we will present an 
enhanced version of the algorithm where, in the first phase, 
nodes are sorted not only by rank and label, but also by a 
recursively defined hash value. This enhancement leads to 
a small increase in cost of the first phase, but may result in 
a substantial reduction in the size of the sets of nodes that 
need to be sorted in the second phase. Thus the enhanced 
algorithm still takes 0(SORT(|iV| + |.E|)) IOs, but with better 
constant factors in practice for certain types of inputs. 

3.2 Time-forward processing 

Our bisimulation partitioning algorithm has two phases in 
which information about nodes must be computed from in- 
formation about their children: in the first phase, we need 
to compute a node's rank (which is one plus the maximum 
rank of its children); in the second phase, we need to as- 
sign bisimilarity identifiers to nodes based on the bisimilar- 
ity identifiers of their children. This would be relatively easy 
if we could access the children of any node n when we pro- 
cess n, but in an external-memory setting, this could cause 
many IOs. 

We can remove explicit access to the children of a node by in- 
troducing an IO-efficient supporting data structure that can 
be used to send information from children to parents. This 
technique is called time-forward processing [7] [19]. Time- 
forward processing can be used when nodes have unique or- 
dered node identifiers such that children have smaller identi- 
fiers than their parents, and the nodes are stored in order of 
their identifiers, each node being stored with its own identi- 
fier and those of its parents. The supporting data structure 
should support two operations: (i) inserting a message ad- 
dressed to a given node, identified by its node identifier, and 
(ii) inspecting and removing all messages addressed to the 
smallest node identifier that is currently present in the data 
structure. 

An algorithm that computes a value for each node depend- 
ing on the values of its children can now be implemented 
as follows. We compute values for all nodes in order of 



Algorithm 1 Phase 1: sort by rank and label 

Input: file of nodes N as records (id, label), sorted by id; 

file of edges E as records (parent, child), sorted by child; 

(id(n), id(m)) £ E implies id(m) < id(n). 
Output: file of nodes N' as records (id, origld, rank, label), 

sorted by id and, simultaneously, by (rank, label); 

file of edges E' as records (parent, child), sorted by child; 

rank(n) > rank(m) implies id(m) < id(n). 

1: create empty file Ranks of records (id, rank, label) 
2: create empty priority queue Q of records 
(id, childsrank), ordered by id 

3: for all (n, label) £ N , in order do 

4: rank 

5: while record at head of Q has id — n do 

6: extract (n, childsrank) from Q 

7: rank <s— max(ranfc, childsrank + 1) 

8: append (n, rank, label) to Ranks 

9: while next edge from E has child — n do 

10: read (parent, n) from E 

11: insert (parent, rank) in Q 

12: sort Ranks lexicographically by rank, label 

13: copy Ranks to N' while assigning new node identifiers 

14: copy E to E' while updating node identifiers in E' 

15: sort E' by child 

16: return (N',E') 



their identifier, and whenever we compute a node's value, 
we insert messages with that value in the supporting data 
structure, addressing these messages to each of the node's 
parents. Thus, before we process each node n, we can obtain 
the values computed for its children by extracting all mes- 
sages addressed to n from the data structure. Each node 
removes all messages addressed to it from the data struc- 
ture, nodes with lower identifiers are processed before nodes 
with higher identifiers, and no messages are ever addressed 
to nodes that have already been processed; thus, when we 
want to extract the messages addressed to n, these messages 
will be the messages with the smallest node identifier cur- 
rently in the data structure and they can be extracted by 
an operation of type (ii). 

The supporting data structure can be implemented as a pri- 
ority queue. There are external-memory priority queues 
that, amortized over their life-time, perform k operations 
of type (i) and (ii) in 6(SORT(fc)) IOs [I]. 

3.3 The bisimulation partitioning algorithm 

Assume the input to our problem consists of a list of nodes 
N, storing a unique node identifier and a label for every 
node, and a list of edges E, specified by the node identifiers 
of their tails (parents) and their heads (children). The list 
N is sorted by node identifier, and the list E is sorted by 
head (child). Recall that the node identifiers are assumed 
to be such that children always have smaller node identifiers 
than their parents. 

Our basic bisimulation partitioning algorithm is now as fol- 
lows. We use time-forward processing to compute the rank 
of each node, and make a copy of the list of nodes in which 



Algorithm 2 Details of line 13 and 14 of Algorithm [T] 
1: nevoid «— 

2: create empty file R of records (origld, nevoid) 
3: create empty file N' of records 

(nevoid, origld, rank, label) 
4: for all (origld, rank, label) £ Ranks, in order do 
5: nevoid nevoid + 1 
6: append (origld, nevoid) to R 
7: append (nevoid, origld, rank, label) to N' 

8: sort R by origld 

9: create empty file E' of records (parent, child) 

10: move read pointer of E to beginning 

11: for all (origld, nevoid) £ R, in order do 

12: while next edge from E has child = origld do 

13: read (parent, origld) from E 

14: append (parent, nevoid) to E' 

15: sort E' by parent 

16: move pointers of R and E' to beginning 

17: for all (origld, nevoid) £ R, in order do 

18: while next edge from E has parent = origld do 

19: read record (origld, child) from E' and 

20: overwrite with record (nevoid, child) 



each node is annotated with its rank. Then we sort the nodes 
lexicographically, with their ranks as primary keys and their 
labels as secondary keys. We give each node a new identi- 
fier which is simply the position of the node in the resulting 
sorted list, and we replace the identifiers in E accordingly, 
producing lists N' and E' . We sort these new lists by the 
(new) node identifiers and by the (new) node identifiers of 
the heads, respectively. This completes the first phase of 
the algorithm. Pseudocode for this phase is given in Algo- 
rithm [Q 

Some additional implementation details on the last lines of 
Algorithm[l]are in order. We can copy Ranks to N' while as- 
signing new node identifiers, going through Ranks in order. 
During this process we construct a list R of (old node iden- 
tifier, new node identifier)-pairs. To obtain a list E 1 with 
updated child node identifiers, we scan E and R in parallel 
from beginning to end, copying the entries of E to E' while 
replacing the child node identifiers by the new identifiers as 
read from R. To update the parent node identifiers in E' 
we sort E' on parent node identifier; we then scan E' and R 
in parallel from beginning to end while replacing the parent 
node identifiers in E' by the new identifiers as read from R. 
Pseudocode is given in Algorithm [2] 

The rank-label combinations of the nodes define a parti- 
tioning of the graph. In the second phase of the algorithm, 
we use time-forward processing to go through the blocks of 
this partitioning one by one. Each rank-label combination 
c is processed as follows. Let N c be the set of nodes that 
have rank-label combination c. For each node of N c , we 
extract the bisimilarity identifiers of its children from the 
priority queue (assuming that they have been placed there) 
and sort them, while removing doubles. Thus we get the 
bisimilarity families for all nodes of N c . Then we sort the 
nodes of N c by bisimilarity family. Finally we go through the 
nodes of N c in order, assigning a unique bisimilarity identi- 
fier bisimld(f) to each maximal group of nodes Nf within 



Algorithm 3 Phase 2: sort by bisimilarity equivalence class 



Input: file of nodes N' as records (id, origld, rank, label), sorted by id and, simultaneously, by (rank, label); 

file of edges E' as records (parent, child), sorted by child; 

rank(n) > rank(m) implies id(m) < id(n). 
Output: file of nodes B as records (origld, bisimld) 

1: create empty file B of records (origld, bisimld) 

2: create priority queue Q of records (id, childsBisimld), ordered by id 

3: lastBisimld <s— 

4: create empty file Group of records (bisimFamily, origld, parents) 

5: for all (n, origld, r, I) G TV"', in order do 

6: create an empty list bistmFamily 

7: while record at head of Q has id — n do 

8: extract (n, childsBisimld) from Q 

9: append childsBisimld to bisimFamily 

10: sort bisimFamily, removing doubles 

11: read all parents of n from E' and put them in a list parents 

12: append (bisimFamily, origld, parents) to Group 

13: if N' has no more records with rank = r, label — I then 

14: sort Group by bisimFamily, while marking the first occurrence of each family 

15: for all (bisimFamily, origld, parents) £ Group, in order do 

16: if bisimFamily is marked then 

17: lastBisimld <— lastBisimld + 1 

18: append (origld, lastBisimld) to B 

19: for all parentld £ parents do 

20: insert (parentld, lastBisimld) in Q 

21: erase Group 

22: return B 



A c that have the same bisimilarity family /, and putting a 
message bisimld(f) in the priority queue for all parents of 
the nodes of Nf. Pseudocode for the second phase of the 
algorithm is given in Algorithm [3] 

Theorem 1. Given a labeled directed acyclic graph G = 
(N, E, I) with its nodes numbered in (reverse) topological or- 
der, we can compute the bisimilarity equivalence classes of 
G in 0(Sort(|7V| + \E\)) IOs. 

Proof. We use Algorithm]!] followed by Algorithm[3] As 
observed in Section [2j bisimilar nodes must have the same 
rank and the same label. As a result, any nodes that are 
bisimilar to each other are processed in the same execution 
of lines 14-21 of Algorithm [3] Based on the induction hy- 
pothesis that nodes of rank r — 1 get the same bisimilarity 
identifier if and only if they are bisimilar to each other, it is 
now easy to show that in lines 14-21, nodes of rank r get the 
same bisimilarity identifier if and only if they are bisimilar 
to each other. 

As for the efficiency of the algorithm, the first phase scans 
and sorts files of at most |7V| + \E\ records, for a total of 
0(Sor.t(|7V| + \E\)) IOs. One record is inserted into and 
extracted from the priority queue for each child-parent rela- 
tion; thus the total number of IOs required by the priority 
queue is 0(Sort(|B|)). 

The second phase is slightly more involved, as it sorts the 
lists bisimFamily and the files Group. For each node, the 
list bisimFamily contains one entry for each edge originating 



from that node; thus the total size of the lists bisimFamily 
is 0(|-E|) and they are sorted in 0(Sort(|_B|)) IOs in total. 
For each node n, one record is added to the file Group: this 
records contains an identifier for n and each of its children 
and parents. Thus the amount of data inserted into Group 
over the course of the entire algorithm is 0(|JV| + \E\). On 
line 14, the variable-size records in Group can be sorted and 
marked with the string sorting algorithm by Arge et al. [5] 
in 0(Sort(|A| + \E\)) IOs. Thus, the complete algorithm 
takes 0(Sort(|AT| + \E\)) IOs. □ 

3.4 Enhanced algorithm 

To reduce the amount of sorting needed in the second phase 
of the algorithm, we propose the following enhanced algo- 
rithm. In the first phase, we not only compute a rank for 
each node, but also a hash value, which is computed from 
the node's label and from the hash values of its children. 
Thus, the first phase of the algorithm is as in Algorithm [4] 

The second phase of the algorithm is exactly as before, ex- 
cept that rank, label is replaced by rank, label, hash; in par- 
ticular, lines 14-21 are executed each time N' has no more 
records with the same rank, label, and hash value as the 
records seen so far. 

By induction on increasing rank one can prove that bisim- 
ilar nodes get the same hash value, and therefore, any pair 
of nodes that are bisimilar to each other will still be pro- 
cessed in the same execution of lines 14-21 in Algorithm [3] 
Thus, the algorithm is still correct. Note that if the hash 
value from a label and any given set of k hash values can be 



Algorithm 4 Phase 1 (enhanced with hash values) 



per node. 



Input: file of nodes N as records (id, label), sorted by id; 

file of edges E as records (parent, child), sorted by child; 

(td(n), id(m)) G E implies id(m) < id(n). 
Output: nodes N' as records (id, origld, rank, label, hash), 

sorted by id and, simultaneously, by (rank, label, hash); 

file of edges E' as records (parent, child), sorted by child; 

rank(n) > rank(m) implies id(m) < td(n). 

1: create empty file Ranks of records (id, rank, label, hash) 

2: create empty priority queue Q of records 
(id, childsRank, childsHash), ordered by id 

3: for all (n, label) £ TV, in order do 

4: rank 4— 

5: initialize empty list childrensH ashes 

6: while record at head of Q has id — n do 

7: extract (n, childsRank, childsHash) from Q 

8: rank <s— max(rank, childsRank + 1) 

9: add childsHash to childrensH ashes 

10: sort childrensH ashes, removing doubles 

11: hash i— hash value from label and childrensHashes 

12: append (n, rank, label, hash) to Ranks 

13: while next edge from E has child = n do 

14: read (parent, n) from E 

15: insert (parent, rank) in Q 

16: sort Ranks lexicographically by rank, label, hash 

17: copy Ranks to TV' while assigning new node identifiers 

18: copy E to E' while updating node identifiers in E' 

19: sort E' by child 

20: return (N',E') 



computed in 0(Sort(£;)) IOs, the complete algorithm also 
still runs in 0(Sort(|AT| + \E\)) IOs in the worst case. 

In many practical settings the priority queues in the first 
phase may be small and fit in main memory. For example, 
if the input graph is a tree in reverse depth-first order, then 
at any time during phase one, the queues will only contain 
messages to/from the nodes on a single path between the 
root and a leaf. As long as the children of a single node 
always fit in memory, the hash values can be computed in 
memory as well. Thus, the cost of computing the hash values 
in the first phase is small, and in practice, each hash value 
will be read or written at most eight times when writing, 
sorting, and reading Ranks and N' . In return, the grouping 
by rank, label, and hash value induces a much finer parti- 
tioning of G than the grouping by rank and label only. As a 
result, the sorting on line 14 of the second phase will be less 
likely to require the use of external memory. Note that each 
node's record in the Group file contains one number for the 
node's original identifier, plus one number for each neigh- 
bour of the node (a bisimilarity identifier for every child, 
and a node identifier for every parent). Therefore, even if 
on average, nodes have only two neighbours, a record in the 
group file has an average size of three numbers. Since sorting 
out-of-memory would take at least two read passes and two 
write passes, this would amount to the transfer of 3 • 4 = 12 
numbers per node to or from disk. Thus, even in this setting 
with few edges, the optimization with hashing may already 
lead to IO-savings in the second phase of twelve numbers 



3.5 Implementation using STXXL 

We have implemented the enhanced bisimulation partition- 
ing algorithm of Section [3. 4| using the building blocks avail- 
able in the STXXL library, a mature open-source C++ li- 
brary which provides basic external memory data structures 
and algorithms [9][^] Since STXXL does not include algo- 
rithms to sort sets of variable-length records in external 
memory, we used the following adaptation of Algorithm [3] 

Instead of storing, for each node n, a record of the form 
(bisimFamily , origld, parents) in the file Group, we store 
the following fixed-size records: (i) one record of the type 
(secondHash, origld), where secondHash is a secondary hash 
value computed from the bisimilarity family of n; (ii) for 
each child of n, a record of the type (secondHash, origld, 
childsBisimld) (these records collectively store the bisimi- 
larity family of n); (iii) for each parent of n, a record of 
the type (secondHash, origld, parentld) (these records col- 
lectively store the parents of n). The secondary hash val- 
ues computed from the bisimilarity families are such that 
bisimilarity families of different size always have different 
secondary hash values. 

In line 14, we sort the above mentioned records lexicograph- 
ically, thus obtaining a list of nodes with their bisimilarity 
families and parents, ordered by secondary hash value. Al- 
though unlikely, collisions may occur: nodes with the same 
secondary hash value (which appear consecutively in the 
sorted list) may have different bisimilarity families. There- 
fore we have to be a bit more careful when assigning bisim- 
ilarity identifiers in line 16-20: when processing nodes that 
have the same secondary hash value, we record their bisim- 
ilarity families with their bisimilarity identifiers in a dic- 
tionary; before assigning a new identifier to a particular 
bisimilarity family, we first check the dictionary to see if 
an identifier for this bisimilarity family had already been 
assigned. Considering that in practice (and, under the as- 
sumption of perfect hashing, also in theory) the dictionary 
is unlikely to ever be large, we used a simple sequential file 
implementation for this dictionary. Bisimilarity families are 
only stored in the dictionary as long as nodes with the same 
secondary hash value are being processed; the dictionary is 
always erased before proceeding to nodes with another rank, 
label, hash value, or secondary hash value. 

4. INDEXING XML DOCUMENTS 

XML documents are widely used to exchange and store tree- 
structured data [I] . In this section we investigate specializa- 
tions of the general algorithm from the previous section to 
efficiently calculate in external memory variations of bisim- 
ulation which have been proposed in the design of indexing 
data structures for XML and semi-structured databases. We 
shall discuss two well-known variations, namely the 1-index 
[20| and the j4(A:)-index [18]. We also briefly discuss how our 
approach can be specialized to efficiently compute the well 
known F&B-index [l]. 

In Figure [2] we have given an example of a simple XML 

1 For more information we refer to the STXXL project page 
at http : //stxxl . sourcef orge .net/. 
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Figure 2: An XML document tree and five different index types; nameiy the 1-index, the F&B-index, and the j4(fe)-index (for 
< k < 3). We have annotated each node in the XML document tree with a unique identifier. This identifier is used in the 
indices to indicate the nodes represented by each index node. 



document tree and indices built from this tree. In the index 
figures, nodes represent partition blocks, and there exists an 
edge from block A to block B if (and only if) there exists an 
edge in the original document from a node in A to a node 
in B. 

4.1 The 1-index 

The 1-index utilizes "backward" bisimulation to relate nodes 
with the same structure with respect to path-queries [20| . 
Backward bisimulation is equivalent to normal bisimulation 
on a graph wherein all edges are reversed in direction. Figure 
[2jb) illustrates the 1-index of our example XML document. 

Backward bisimulation combined with the nested tree-struc- 
ture of XML documents gives us several properties that we 
can utilize to optimize bisimulation partitioning. Recall that 
the basic algorithm from Section [3 . 3| consists of two phases: 
in the first phase nodes are sorted by rank and label; in the 
second phase nodes of the same rank and label are sorted 
by bisimilarity family. Alternatively we could take the fol- 
lowing approach: in the first phase we sort by rank only; 
in the second phase we sort nodes of the same rank by la- 
bel and bisimilarity family. Obviously, this would not af- 
fect the correctness and the asymptotic I/O-complexity of 
the algorithm. However, in the case of 1-indexes for XML- 
documents it brings the following advantage: we can avoid 
the use of a priority queue in the first phase, and we can 
avoid use of several sorting passes to assign identifiers to 
nodes. We achieve this as follows. 

Instead of assigning identifiers to nodes by first sorting the 
nodes by rank and then using the positions in the sorted list 
as identifiers, we can use composite node identifiers of the 



form [rank, idOnLevel) , where rank is the backward rank 
of the node (that is, the node's depth in the tree), and 
idOnLevel is a unique identifier with respect to all nodes 
with backward rank rank. We can now use the structure of 
XML documents to compute these identifiers efficiently — in 
particular we will exploit the fact that an XML document 
essentially stores a so-called Euler tour [l9] of the tree, in 
order. 

Our algorithm will traverse the tree while maintaining a 
counter depth that holds the depth of the current position 
in the tree, and an array count, in which the i-th number 
(denoted count[i]) holds the number of nodes at depth i en- 
countered so far. Initially, depth — and the array count 
is empty; whenever we try to read an element of count that 
does not exist yet, this element will be created and initialized 
to zero. 

Now, when we read a start-tag representing a node n during 
the processing of an XML document, this node is assigned 
backward rank rank = depth and idOnLevel — count[rank]; 
we increment both count[rank] and depth by one, and we 
construct an edge to n from its parent: this must be the 
last node encountered on the previous level, with composite 
identifier (rank — 1, count[rank — 1] — 1). When we read an 
end-tag we simply decrement depth by one. After reading 
the complete tree, we simply sort the nodes by composite 
identifier, and the edges by the composite identifiers of the 
parents. Pseudocode for the complete first phase of the al- 
gorithm is given in Algorithm [5] 

The second phase of the algorithm is now simple to imple- 
ment. Note that we are computing backward bisimilarity 



Algorithm 5 XML 1-index, phase 1: sort by depth 

Input: XML document D. 

Output: file N' of XML nodes as records 

(rank, idOnLevel, origld, label), 

sorted by (rank, idOnLevel); 

file E' of edges as records 

(parentRank, parentldOnLevel, childldOnLevel), 
sorted by (parentRank, parentldOnLevel); 

1: create empty file N' 

2: create empty file E' 

3: create empty array of counters count 

4: depth <s— 

5: for all tags tag of D, in order do 
6: if tag is a start tag then 
7: if count[depth] does not exist then 
8: add an entry count[depth] = to count 

9: if depth / then 
10: append 

(depth — 1, count[depth — 1] — 1, count [depth]) 
to E' 

11: determine node identifier origld and label label 

12: append (depth, count[depth], origld, label) to N' 

13: increment count[depth] 

14: increment depth 

15: else if tag is an end tag then 

16: decrement depth 

17: sort N' by (rank, idOnLevel) 

18: sort E' by (parentRank, parentldOnLevel) 

19: return (N',E') 



index groups nodes based on a summary of their structure 
with respect to both ancestors and descendants [l] [l6] . Fig- 
ure[5|c) illustrates the F&B-index of our example XML doc- 
ument. 

For trees, Grimsmo et al. have shown that the F&B-index 
partitioning can be obtained by first computing forward 
bisimulation and then refining the obtained partition by 
computing backward bisimulation, i.e., by applying the al- 
gorithm from Section[3]twice (once with edges reversed) [14] . 
It is possible to significantly reduce the cost of this computa- 
tion, by a straightforward adaptation of the algorithm from 
Section [4.1 1 for the backwards bisimulation step |15| . 

4.2 The A(fc)-index 

The A(k)-index utilizes backward node fc-bisimulation, a lo- 
calized variant of backward node bisimulation. The A(k)- 
index groups nodes n based on the structure of ancestor 
nodes at most k steps away from n. 

Definition 3. Let G = (N, E, I) be a graph, m,n £ N , and 
k > 0. We say m and n are backward fc-bisimilar, denoted 
n ^ k m, if and only if k — and l(n) = l(m), or k > and: 

1. the nodes n and m are backward (k— l)-bisimilar, that 
is, n m; 

2. for each n' £ parents(n), there is an ml £ parents(m) 
with n' ^ fc_1 ml , and 

3. for each ml £ parents(m), there is an n' £ parents(n) 
with n' Ri''" 1 ml . 



equivalence classes, and therefore parents and children have 
switched roles. Thus, the bisimilarity family of a node is 
simply the bisimilarity identifier of the parent of a node, 
and no implementations of string sorting or secondary hash 
functions and dictionaries (as in Section |3.5[ ) are needed. 
Pseudocode is given in Algorithm [6] 

Theorem 2. Given an XML-document of N nodes, we 
can compute its 1-index in 0(Sort(|JV|)) IOs. 

Proof. We use Algorithm [5] followed by Algorithm [6] 
The correctness of the algorithm follows from the same ar- 
guments as for Algorithm [I] and Algorithm [3] in Theorem [I] 

As for the IO-complexity, observe that the accesses to the file 
count follow a very well-structured pattern: effectively we 
move ahead in the file by one step whenever we encounter a 
start tag, and we move back in the file by one step whenever 
we encounter an end tag. Thus, if we keep the two most 
recently accessed blocks of the file in memory, at least B 
tags must be read between successive IOs on the count file. 
Otherwise, the algorithm runs in 0(Sort(|]V| + \E\) IOs by 
the same arguments as for Theorem [T] since \E\ = \N\ — 1, 
this simplifies to 0(Sort(|JV|)) IOs. □ 

4.1,1 The F&B-index 

The 1-index summarizes the structure of graphs by only 
looking in one direction, from parent to child. The F&B- 



Figures[2];d)-(g) illustrate the A(0)-, A(l)-, A(2)-, and 4(3)- 
index, resp., of our example XML document. 

The 4(fc)-index seems similar to the 1-index. However, 
there is a critical difference between the two. Whereas all 
backward bisimilar nodes have the same rank, this does 
not necessarily hold for backward fc-bisimilar nodes. We 
thus cannot use backward rank to localize the partitioning 
computations. We can, however, express backward node k- 
bisimilarity on trees in another way; namely, in terms of 
fc-traces. 

Definition 4- Let G = (N,E,l) be a tree, r £ N be the 
root of G, 7i G N, and L(r,n) — (l(r), . . . ,l(n)) be the se- 
quence of labels of the nodes on the path from r to n in E. 
For k > 0, the fc-trace of L(r, n), denoted T„, is the sequence 
containing the last k elements in L(r,n). If k > \L(r, n)\, 
the length of L(r, n), then the fc-trace is constructed by pre- 
fixing L(r, n) with k — \ L(r, n)\ occurrences of some reserved 
label A not in the range of 

The fc-traces, which are easily represented by fixed-size val- 
ues, are used for identifying backward fc-bisimilar equivalent 
nodes, as follows. 

Proposition 1. Let G = (N,E,l) be a tree, m,n € N, 
and k>0. Then n ^ k m if and only ifT k+1 = T* +1 . 



Algorithm 6 XML 1-index, phase 2: sort by backward bisimilarity equivalence class 



Input: file of XML nodes N' as records (rank, idOnLevel, origld, label), sorted lexicographically by (rank, idOnLevel); 

file of edges E' as records (parentRank, parentldOnLevel, childldOnLevel), 

sorted lexicographically by (parentRank, parentldOnLevel); 
Output: file of XML nodes B as records (origld, bisimld) 

1: create empty file B of records (origld, bisimld) 

2: create priority queue Q of records (rank, idOnLevel, parentBisimld) , ordered by (rank, idOnLevel) 
3: insert (—1,0,0) in Q (sentinel for root) 
4: lastBisimld 

5: create empty file Group of records (label, parentBisimld, origld, children) 

6: for all (r, idOnLevel, origld, label) £ N' , in order do 

7: extract (r, idOnLevel, parentBisimld) from Q 

8: create an empty list children 

9: while next record of E' has parentRank — r and parentldOnLevel = idOnLevel do 

10: read (parentRank, parentldOnLevel, childldOnLevel) from E 

11: append childldOnLevel to children 

12: append (label, parentBisimld, origld, children) to Group 

13: if N' has no more records with rank = r then 

14: sort Group lexicographically by (label, parentBisimld) 

15: for all (label, parentBisimld, origld, children) £ Group, in order do 

16: if Za&e/ and parentBisimld are not the same as in previous record of Group then 

17: lastBisimld <— lastBisimld + 1 

18: append (origld, lastBisimld) to B 

19: for all childldOnLevel £ children do 

20: insert (r + 1, childldOnLevel, lastBisimld) in Q 

21: erase Group 



22: 



return B 



While processing an XML document, we can use a stack to 
store the labels of all parents of the current node n by push- 
ing the label of a node onto the stack when we encounter 
a start-tag and popping the top of the stack when we en- 
counter an end-tag. By taking the topmost k + 1 elements 
we get T% +1 , the (A; + l)-trace of n. This leads to a simple 
A(fc)-index construction algorithm for XML documents, a 
sketch of which is presented in Algorithm [7] The algorithm 
has a worst-case IO-complexity of 0(SORT(fcjiVj)) IOs. 




were performed on a standard laptop with an Intel Core i5- 
560M processor and 4GB of main memory. We have used 
the internal hard disk drive of this system for sorting and for 
storing temporary data structu res such as priority queues. 




Algorithm 7 ^(fc)-index construction for XML documents 



Input: XML document D. 

Output: file of XML nodes B as records (origld, trace) 



1: create empty file B of records (origld, trace) 

2: create empty stack S 

3: push k dummy labels A onto S 



For this experiment, random graphs having between 100 • 
10 6 and 1000 ■ 10 6 nodes were created using the generator 
described in Appendix [X] Every graph had an average of 
three to four edges per node. The file size of the input 
graphs ranged between 2.1GB and 21.2GB. The number of 
bisimulation partition blocks in the output ranged from 70 ■ 
10 6 for the smallest graph to 708 ■ 10 6 for the largest graph. 
For the largest input we have measured a total of 35919 reads 
from disk; 35091 writes to disk. Thereby a total 70.1GB was 
read and 68.5GB was written. These measurements only 
include temporary file usage (priority queues and sorting); 
not the reading of input and writing of output. In Figure [3] 



4: for all tags tag of D, in order do 

5: if tag is a start tag then 

6: determine node identifier origld and label label 

7: push label onto S 

8: append (origld, top k + 1 elements of S) to B 

9: else if tag is an end tag then 

10: pop one label from S 

11: Sort B by trace 
12: return B 




2 Open-source code of the full C++ implementation of the 
algorithms and supporting tooling used in our analysis can 
be found at http://jhellings.nl/projects/exbisim/. 
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Figure 3: Performance of the bisimulation algorithm from 
Section [3] (Experiment 1). On the left, running time per 
node and edge is plotted against the number of nodes in the 
input. On the right, the number of IOs performed per node 
and edge is plotted against the number of nodes in the input. 
The subscript rl indicates an initial partition based on rank 
and label, the subscript ss indicates an initial partition based 
on rank, label, and hash value. 
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Figure 5: Impact of available internal memory on the per- 
formance of the bisimulation algorithm from Section [3] (Ex- 
periment 3). On the left, running time per node and edge 
is plotted against the amount of available memory. On the 
right, the number of IOs performed per node and edge is 
plotted against the amount of available memory. Note that 
the amount of available memory excludes stack space used 
by local variables and the memory used for buffers (256MB 
in total). 
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Figure 4: Performance of the bisimulation algorithm from 
Section [3] (Experiment 2). On the left, running time per 
node and edge is plotted against the number of edges in the 
input. On the right, the number of IOs performed per node 
and edge is plotted against the number of edges in the input. 



we have plotted the results for this experiment. 

Experiment 2. In this experiment we measured the perfor- 
mance of the general bisimulation algorithm from Section [3~5] 
as a function of the number of edges in the input graph. To 
this end, we created graphs having 5- 10 4 nodes and between 
and 1249 • 10 6 edges, using the generator as described in 
Appendix [A] In Figure[4]we have plotted the results for this 
experiment. 

Experiment 3. In this experiment we measured the per- 
formance the general bisimulation algorithm of Section |3.5| 
on a single graph as a function of the amount of available 
memory (per data structure). For this experiment we fixed a 
single graph with 10 s nodes and 3.3- 10 s edges, generated as 
described in Appendix [A] On this graph we performed ex- 
ternal memory bisimulation partitioning, using versions of 
the algorithm from Section [3] constrained to a limited mem- 
ory usage. We used values between 12 MB and 1.5 GB for 
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Figure 6: Comparing the performance of the DAG bisimu- 
lation algorithm of Section [3] and the 1-index algorithm of 



Section 4.1 (Experiment 4). On the left, running time per 
node is plotted as a function of the scaling factor. On the 
right, the number of IOs performed per node is plotted as a 
function of the scaling factor. 



the amount of memory the algorithm is allowed to use. In 
Figure [5] we have plotted the results for this experiment. 

Experiment 4. In this experiment we compared the per- 
formance of, on one hand, the general DAG bisimulation 
algorithm from Section [3,5| and, on the other hand, the spe- 
cialized algorithm from Section [4. 1| for 1-index construction 
on XML documents. The performance of both algorithms 
is measured as a function of the size of the input graph. 
For this experiment we created XML documents using the 
xmlgen program provided by the XML Benchmark Project]^] 
For the generation of XML documents we have used scaling 
factors between 50 and 500, resulting in documents with 
sizes between 5.6GB (10 8 nodes) and 55.8GB (10 9 nodes). 
In Figure [6] we have plotted the results for this experiment. 



We have used version 0.92 of xmlgen, see http : / /www . xml- 
benchmark.org/ for details. 



Analysis of results. The experiments all show that under 
all tested conditions, the general algorithm from Section [3] 
is and stays IO-efficient, even when available memory is ar- 
tificially limited or when the number of nodes and edges is 
very high. We also see from Experiment 4 that specializa- 
tions of our algorithm can outperform the general algorithm 
with a good margin. In particular, we were able to pro- 
cess an 55.8 GB XML document of 10 9 nodes, generated by 
software from the XMark XML benchmark project, in 104 
minutes on a standard laptop with a standard hard disk. 



querying RDF graphs and general graph databases, cycles in 
the data are common. Looking at the current state of gen- 
eral external- memory graph algorithms [19], it is not clear 
that solutions for IO-efficient bisimulation partitioning on 
cyclic graphs are likely to exist. One can however focus 
research on heuristic approaches to achieve acceptable per- 
formance in many cases, as is common for many external- 
memory algorithms (e.g., [3]). Extending our algorithms 
with such heuristics to handle cycles is an interesting direc- 
tion for further study. 



Experiment 1 further shows that a good initial partition 
(by rank, label and hash value) improves performance over 
the less-refined initial partitions (by rank and label only). A 
deeper look into the results of this experiment show that this 
performance improvement is due to a high increase in the 
number of partition blocks in the input for the second phase. 
This is as expected and does partly account for the improve- 
ment of performance. The results also show a reduction of 
the collisions on the secondary hash values that are com- 
puted from bisimilarity families (as explained in Section ["3.5[ ) 
— in fact, collisions were completely eliminated 
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Experiment 3 shows that the algorithm does benefit from 
having more memory at its disposal. However, the impact 
of an increase in available memory becomes less significant 
for larger amounts of available memory. 

From the results of the experiments and from the structure 
of the algorithm we do not expect that certain types of DAGs 
will have a much better performance than others. An in- 
depth look into the running time performance shows that it 
is mainly dominated by the first phase; and within this phase 
the majority of time is spend on sorting and renumbering 
the entire graph (last lines of Algorithm [I] and Algorithm|4|. 
This sorting and renumbering is unaffected by any particular 
graph structures. 

6. CONCLUDING REMARKS 

In this paper we have developed the first IO-efficient bisim- 
ulation partitioning algorithm for DAGs. We also devel- 
oped specializations of our general algorithm to compute 
well-known variants of bisimulation for disk-resident XML 
data. We have complemented our theoretical analysis of 
these algorithms with an empirical investigation which es- 
tablished their practicality on graphs having billions of nodes 
and edges. 

The proposed algorithms are simple enough for practical im- 
plementation and use, for example in the design and imple- 
mentation of scalable indexing data structures to facilitate 
efficient search and query answering in a wide variety of real- 
world applications of DAG-structured data, as discussed in 
Section [l] 

Future work. The conceptual and practical results devel- 
oped here pave the way for a variety of further investigations. 
We conclude the paper with a brief discussion of some of 
these. 

Generalizing bisimulation partitioning. DAGs are adequate 
for representing XML data and other practical types of hi- 
erarchical data. However, for some applications, including 



Partition maintenance. One can expect that a practical data 
set might be subject to modifications over time. Upon mod- 
ification, it becomes necessary to update any bisimulation 
partition maintained on the data. Of course, this mainte- 
nance can be performed by throwing out the old partition 
and computing a new one from scratch. It is easy to show 
that, in the worst-case, partition maintenance can indeed 
be as expensive as calculating a new partition from scratch. 
In many practical cases, however, such a drastic approach 
is avoidable. For example, approximations of bisimulation 
which are cheaper to maintain may be acceptable. Internal- 
memory approaches to incremental partition maintenance in 
this spirit have been proposed, 



e.g.. 
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. Studying 

such practical maintenance schemes for disk-resident data is 
another interesting direction for future research. 

Practical output formatting. In the empirical validation of 
our algorithms, we have not considered any particular out- 
put format. An interesting research problem is to consider 
adapting our algorithms such that their output is usefully 
structured for some intended applications. For example, on 
XML documents one can explore the combination of our al- 
gorithms with the on-disk data structure studied by Wang 
et al. [28l. 
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APPENDIX 

A. GENERATING BENCHMARK DATA 

Developed as part of our open-source experimental frame- 
work [15] , the gen program is a benchmark graph generator. 
The program can be configured to create random DAGs, 
trees, chains and transitive-closure chains, with control of 
basic features such as node label assignment and graph size. 
We used gen to generate the input to Experiments 1-3, dis- 
cussed in Section [5] The generator does not try to represent 
any particular class of graph structures, instead focusing on 
the worst-case scenario of random structure, to stress-test 
our algorithms. 

The program uses a direct approach for generating the in- 
put for Experiment 1. First, gen creates n nodes. To each 
node v, gen assigns a label from a limited set of labels that 
depends on n. Then gen selects children to be connected to 
v by repeatedly flipping a coin that comes up heads with a 
certain probability p, that is given as a parameter to gen. 
Whenever the coin comes up heads, a new child for v is se- 
lected from the nodes that were generated before v; if the 
new child was already a child of v, it is ignored. As soon 
as the coin comes up tails, our program gen stops selecting 
children for v, and moves on to generating the next node. 

The program uses a slightly different approach for generating 
the random graphs used in Experiment 2. Again, gen creates 
n nodes and assigns labels in the way described above. To 
create edges, gen considers every pair of nodes u, v such that 
u was generated before v, and creates an edge from u to v 
with probability p. 

For Experiment 3 we use the same approach as for Experi- 
ment 1. 



